Introduction

Americas greatest pasttime, Baseball. one fo the most storied sports in our history. but what if i told you there was a secret league of some of the greatest baseball players that we’ve never heard of. Thus secret league was the NLB, Negro league baseball, which was a collection of smaller leagues of negro players during segregation in america. these leagues started around the 1920’s and ended around the 1950’s when many of these players were finally accepted into the MLB. I will be exploring some of these forgotten players and some of the statistics that were hidden for many years.

Data

https://github.com/fivethirtyeight/negro-leagues-player-ratings

The github repository with the dataset, this analyses will explain the story and stats of many forgotten baseball stars.

Barrier of entry:

Negro league: 150 games as a batter or 60 games + starts as a pitcher

MLB: 300 games as a batter or 350 games + starts as a pitcher

The MLB players include both current players and Hall of Fame players

The data comes from fivethirtyeight and some of their data was sourced from https://www.seamheads.com/NegroLgs/ which is a collection of the NLB statistics.

The goal of our analysis

Libraries

library(tidyverse) #manipulate data
library(dplyr)
library(ggplot2) #for visualization
library(plotly) #interactive graphs

#install.packages("plotly")

Our Table

library(readr)
RawNLBandMLB <- read_csv("negro-leagues-player-ratings.csv")

glimpse(RawNLBandMLB)
## Rows: 1,117
## Columns: 25
## $ playerID     <chr> "culbech01", "gosseph01", "herrmch01", "kratzer01", "pire…
## $ commonName   <chr> "Charlie Culberson", "Phil Gosselin", "Chris Herrmann", "…
## $ league       <chr> "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "…
## $ hof          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ startYear    <dbl> 2012, 2013, 2012, 2010, 2014, 2015, 2011, 2014, 2008, 201…
## $ endYear      <dbl> 2020, 2020, 2019, 2020, 2019, 2019, 2019, 2019, 2019, 202…
## $ totalGames   <dbl> 428, 359, 370, 335, 302, 326, 461, 419, 386, 313, 376, 48…
## $ positionWar  <dbl> -0.620, 0.895, -1.150, 1.715, 0.545, 1.310, -1.555, 4.340…
## $ averageHit   <dbl> 41.791451, 72.992105, 3.648244, 21.236047, 67.574190, 10.…
## $ patience     <dbl> 13.776205, 28.641438, 70.106180, 19.112442, 18.976314, 24…
## $ power        <dbl> 41.709774, 16.879935, 44.105636, 69.670569, 37.244759, 9.…
## $ speed        <dbl> 64.524912, 58.562483, 75.850803, 1.334059, 78.872856, 81.…
## $ defense      <dbl> 24.25810, 44.89518, 36.48244, 99.59161, 38.95998, 90.4982…
## $ gameCutoff   <dbl> 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 30…
## $ playerLabel  <chr> "Active Player", "Active Player", "Active Player", "Activ…
## $ shortWar     <dbl> -0.2346729, 0.4038719, -0.5035135, 0.8293433, 0.2923510, …
## $ positionCat  <chr> "Outfielder", "Middle IF", "Catcher", "Catcher", "Middle …
## $ position     <chr> "Batter", "Batter", "Batter", "Batter", "Batter", "Batter…
## $ careerStarts <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strikeOuts   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ control      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ fip          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ whip         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ era          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ fact         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Variables

Filtering

There is a lot of variables within this dataset that we dont need within the scope of our analysys so, were going to filter down to only the variables we need. We are then going to split the data into a NLB set and a MLB to use if we need.

NLBandMLB <- RawNLBandMLB %>% select(playerID, commonName, league, hof, startYear, endYear, totalGames, positionWar, averageHit, defense, gameCutoff, playerLabel, shortWar, positionCat, position, era)

NLB <- NLBandMLB %>% filter(league == 'NLB')

MLB <- NLBandMLB %>% filter(league == 'MLB')

glimpse(NLBandMLB)
## Rows: 1,117
## Columns: 16
## $ playerID    <chr> "culbech01", "gosseph01", "herrmch01", "kratzer01", "pirel…
## $ commonName  <chr> "Charlie Culberson", "Phil Gosselin", "Chris Herrmann", "E…
## $ league      <chr> "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "M…
## $ hof         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ startYear   <dbl> 2012, 2013, 2012, 2010, 2014, 2015, 2011, 2014, 2008, 2018…
## $ endYear     <dbl> 2020, 2020, 2019, 2020, 2019, 2019, 2019, 2019, 2019, 2020…
## $ totalGames  <dbl> 428, 359, 370, 335, 302, 326, 461, 419, 386, 313, 376, 489…
## $ positionWar <dbl> -0.620, 0.895, -1.150, 1.715, 0.545, 1.310, -1.555, 4.340,…
## $ averageHit  <dbl> 41.791451, 72.992105, 3.648244, 21.236047, 67.574190, 10.8…
## $ defense     <dbl> 24.25810, 44.89518, 36.48244, 99.59161, 38.95998, 90.49823…
## $ gameCutoff  <dbl> 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300…
## $ playerLabel <chr> "Active Player", "Active Player", "Active Player", "Active…
## $ shortWar    <dbl> -0.2346729, 0.4038719, -0.5035135, 0.8293433, 0.2923510, 0…
## $ positionCat <chr> "Outfielder", "Middle IF", "Catcher", "Catcher", "Middle I…
## $ position    <chr> "Batter", "Batter", "Batter", "Batter", "Batter", "Batter"…
## $ era         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

This data is a lot more condense and ready for our analysts

Could NLB player even compete in the majors?

So first lets see if any players have statistics from both the MLB and the NLB to compare to each other to see if their stats are similar in both leagues.

# Fidning the stats for players that have entries in both the MLB and NLB graphs 
duplicatedData <- inner_join(x = NLB, y = MLB, by = "commonName") %>% select("commonName") 

PlayersInBothLeagues <- inner_join(NLBandMLB, duplicatedData, by = "commonName")

ggplot(PlayersInBothLeagues, aes(y = shortWar, x = commonName, fill =league )) +
  geom_bar(stat= 'identity', position = "dodge")

There were only 3 players in this dataset that had stats in both the MLB and NLB, so lets compare their shortWars across leagues. we can see the biggest differnce being in sam crawfords stats as his short war jumped almost 4 points. the story not told within this stats is that when Sam Crawford changed leagues he also changed positions going from a pitcher to an outfiled which makes his WAR stats jump a lot so his transfer is hard to compare. But for Roy and Monte we can see Roy was almost the exact same level player he was in the NLB, and Monte dropped from an almost MVP player in the NLB to an All Star in the MLB. While this sample size is really small it shows up the compotition in both leagues are fairly comprabale

Are the distribution of WAR similar across both leagues?

# creating a box plot to show the distribution of WAR across leagues 
ggplot(NLBandMLB, mapping = aes(league, shortWar, fill = league)) +
  geom_boxplot()

While this is a very broad question it shows that the negro players on average were a little worse than the MLB players but there maybe a hidden reason why the NLB graph is lower on average.

Who were the very best in the NLB and how to they compare to the MLB?

ggplot(NLBandMLB, aes(shortWar, positionWar, color = league)) +
  geom_point()

This graph is a graph of all players in this dataset and colored by league. PositionWar is notibley better for the MLB players but positionWar increases the more games you play so the top MLB players have over 2,000 games played compared to the NLB players who cap out around half of that. While the spread across shortWar is pretty even what intrest me the most is the the number of NLB playes in the negative in terms of position and ShortWar. The reasoning for this is the same as to why the box plot above was skewed downwards for the Negro league and that is, the barrier of entry is a lot lower for the NLB for this dataset. This is not a complete view of every MLB player and while there are mny MLB players that have negative WAR stats for their careers, like the negro players, they didnt play enough games in the MLB to be counted in this set as some of the NLB players were in and out the league pretty quick but were able to play long ennough to qualify to be in this dataset.

Lets look at the player all the way in the bottom left, his name is Percy Forrest. Percy was a pitcher and an outfileder. on average across his 6 seasons as a pro Percy only played 6 games a season and the NLB season had 81 games and his war stats show that he was a bad player that didnt play much only 6 games a season but was able to stick around the league for 6 seasons to play enough games in total to qualift to be in this dataset. In comparision to the MLBs qualifications where played have to play around 300 games, its more likley that soemone with 300 games played wouldnt be nearly as bad as Percy was.

Who are the superstars in the NLB?

NLBinteractive <- plot_ly(NLB, x = ~shortWar, y = ~positionWar, type = 'scatter', mode = 'markers',
        text = ~paste('Name ', commonName))

NLBinteractive

With this graph we can see some of the best players that were in the Negro leagues. we see names like Josh Gibson, Dobbie Moore and Charlie Smith, who put up similar shortWar stats to Babe Ruth, the unanimous best baseball player ever, and these are some names that we’ve never heard of. These 3 players have shortWars over 10 making them some of the best baseball players of all time. its interesting to look over some of these names and see the amazing stats they put up through their careers.

Where would the NLB batters rank all time?

NLBandMLB %>% select(commonName, league, averageHit, shortWar, hof) %>% arrange(desc(averageHit)) %>% slice_head(n=20)
## # A tibble: 20 × 5
##    commonName       league averageHit shortWar   hof
##    <chr>            <chr>       <dbl>    <dbl> <dbl>
##  1 Ty Cobb          MLB         100       8.02     1
##  2 Charlie Smith    NLB         100      10.3      0
##  3 Nap Lajoie       MLB          99.9     7.09     1
##  4 Ed Delahanty     MLB          99.9     7.97     1
##  5 Ted Williams     MLB          99.9     8.92     1
##  6 Rogers Hornsby   MLB          99.9     9.23     1
##  7 Tris Speaker     MLB          99.8     7.70     1
##  8 Rod Carew        MLB          99.8     5.05     1
##  9 Tony Gwynn       MLB          99.7     4.45     1
## 10 Josh Gibson      NLB          99.7    10.9      1
## 11 George Sisler    MLB          99.7     4.18     1
## 12 Wade Boggs       MLB          99.6     5.96     1
## 13 Honus Wagner     MLB          99.6     8.25     1
## 14 Stan Musial      MLB          99.6     6.83     1
## 15 Harry Heilmann   MLB          99.5     5.31     1
## 16 Roberto Clemente MLB          99.5     5.84     1
## 17 Eddie Collins    MLB          99.4     7.00     1
## 18 Heavy Johnson    NLB          99.4     5.42     0
## 19 Jose Altuve      MLB          99.4     4.50     0
## 20 Babe Ruth        MLB          99.3    11.1      1

First lets look at the batting statistics for players in both of these leagues, remember this statistics for batting are in percentile. We can see the 100th percentile as the best batters of all time and One is Ty Cobb who is a very famous player and regarded as one of the best ever, But right under Ty we have Charile Smith a negro player who is also in the 100th percentile of batters who many people have never heard of. and with only these two players we can see Charile has a shortWar of 2 points better than Ty. While theres not an abundance we also see Josh Gibson in the 99th percentile abd for both of these NLB players the only player in the top 20 batters with a higher shortWar is Babe Ruth making there statisticsn an amazing feat.

ggplot(NLBandMLB, aes(averageHit, fill = league)) +
  geom_histogram(binwidth = 10) +
  facet_grid(~league)

This graph shows the amount of players in each percentile for Batting and which league theyre in. as we can see in the top ~95 percentile there are around 20 players in the NLB. and while that not a huge amount of players that is still a significant amount of players that have been forggetn through history, while we speak about Babe Ruth, Ty Cobb, Hank Aaron, there are 20 names of NLB players that we could add to that conversation.

Well who was pitching to these batters?

This is a great question to ask because the batting stats for the Negro league players would be as impactful if the pitchers in the negro leagues were bad so lets look and see.

NLBandMLB %>% select(commonName, era, shortWar, league, hof, playerLabel, position) %>% filter(era > 90) %>% ggplot() + geom_bar(aes(playerLabel,fill=position)) + ggtitle("ERA 90th percentile")

In this graph we can see the amount of players in the 90th percentile of ERA in the 3 main categories of our dataset. we can see that there are around 20 negro league pitchers that acount for the 90th percientile showing that there were very good pitchers in the negro league making the stats that Charlie smith and Josh Gibson just as impressive as thsoe of babe ruth and Ty cobb. In this graph we also see a huge influx of active players in this 90th percentile. The reasoning for this is a long story but its a combination of a lot of things. a lot of great pitches have entered the league theres been advancments in analytics anf how to pitch theres been a lot of cheating to improve things like spin rates to make thes pitchers better, theres been a lot of rule changes to favor pitches. The MLB is currently in the proccess of changing rules and cracking down on cheating which would lower a lot of these statistics for the sctive platers in the MLB right now.

How did their War compare to their counterparts?

ggplot(NLBandMLB, aes(era, shortWar, color = league)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Era and War of pitchers")

The Graph above shows the short war for the pitchers and the percentile their in for ERA. We can see on averagwe for the 50th percentile and up the shortWar for the pitchers in the NLB were better than their MLB counterparts. While i dont think all the Negro league putchers were better than the MLB pitchers i think this graph shows that there were very good pithcers in the negro leagues further the impressiveness of the batting stats from many of the negro lrague players.

Who were these players?

NLBinteractivePitchers <- plot_ly(NLB, x = ~era, y = ~shortWar, type = 'scatter', mode = 'markers',
        text = ~paste('Name: ', commonName))

NLBinteractivePitchers

This graphs gives the names of many of the greatest pitcher in the negroo lragues players like Stachel Paige, Jose Leblanc, and Martin Dihigo. 3 Pithcers that should be spoke about in the conversation of the greatest pitchers in the history of the game.

Summary

After looking and comparing the data I believe it was right for the MLB to recognize and add the stats of many of these players to the MLB as they had very similar competition and many that came out of the NLB was able to produce as similar if not higher levels in the MLB then they did while in the NLB. I think its important that we push the story of many of these Negro league players and shine the light on the league that was left in the darkness for many years.